Goto

Collaborating Authors

 rgb image




Enhancing Motion Deblurring in High-Speed Scenes with Spike Streams

Neural Information Processing Systems

Traditional cameras produce desirable vision results but struggle with motion blur in high-speed scenes due to long exposure windows. Existing frame-based deblurring algorithms face challenges in extracting useful motion cues from severely blurred images. Recently, an emerging bio-inspired vision sensor known as the spike camera has achieved an extremely high frame rate while preserving rich spatial details, owing to its novel sampling mechanism. However, typical binary spike streams are relatively low-resolution, degraded image signals devoid of color information, making them unfriendly to human vision. In this paper, we propose a novel approach that integrates the two modalities from two branches, leveraging spike streams as auxiliary visual cues for guiding deblurring in high-speed motion scenes. We propose the first spike-based motion deblurring model with bidirectional information complementarity. We introduce a content-aware motion magnitude attention module that utilizes learnable mask to extract relevant information from blurry images effectively, and we incorporate a transposed cross-attention fusion module to efficiently combine features from both spike data and blurry RGB images.Furthermore, we build two extensive synthesized datasets for training and validation purposes, encompassing high-temporal-resolution spikes, blurry images, and corresponding sharp images. The experimental results demonstrate that our method effectively recovers clear RGB images from highly blurry scenes and outperforms state-of-the-art deblurring algorithms in multiple settings.


Training an Open-Vocabulary Monocular 3D Detection Model without 3D Data

Neural Information Processing Systems

Open-vocabulary 3D object detection has recently attracted considerable attention due to its broad applications in autonomous driving and robotics, which aims to effectively recognize novel classes in previously unseen domains. However, existing point cloud-based open-vocabulary 3D detection models are limited by their high deployment costs. In this work, we propose a novel open-vocabulary monocular 3D object detection framework, dubbed OVM3D-Det, which trains detectors using only RGB images, making it both cost-effective and scalable to publicly available data. Unlike traditional methods, OVM3D-Det does not require high-precision LiDAR or 3D sensor data for either input or generating 3D bounding boxes. Instead, it employs open-vocabulary 2D models and pseudo-LiDAR to automatically label 3D objects in RGB images, fostering the learning of open-vocabulary monocular 3D detectors. However, training 3D models with labels directly derived from pseudo-LiDAR is inadequate due to imprecise boxes estimated from noisy point clouds and severely occluded objects.


End-to-end Multi-modal Video Temporal Grounding

Neural Information Processing Systems

We address the problem of text-guided video temporal grounding, which aims to identify the time interval of a certain event based on a natural language description. Different from most existing methods that only consider RGB images as visual features, we propose a multi-modal framework to extract complementary information from videos. Specifically, we adopt RGB images for appearance, optical flow for motion, and depth maps for image structure. While RGB images provide abundant visual cues of certain events, the performance may be affected by background clutters. Therefore, we use optical flow to focus on large motion and depth maps to infer the scene configuration when the action is related to objects recognizable with their shapes. To integrate the three modalities more effectively and enable inter-modal learning, we design a dynamic fusion scheme with transformers to model the interactions between modalities. Furthermore, we apply intra-modal self-supervised learning to enhance feature representations across videos for each modality, which also facilitates multi-modal learning. We conduct extensive experiments on the Charades-STA and ActivityNet Captions datasets, and show that the proposed method performs favorably against state-of-the-art approaches.


OSMO: Open-Source Tactile Glove for Human-to-Robot Skill Transfer

Yin, Jessica, Qi, Haozhi, Wi, Youngsun, Kundu, Sayantan, Lambeta, Mike, Yang, William, Wang, Changhao, Wu, Tingfan, Malik, Jitendra, Hellebrekers, Tess

arXiv.org Artificial Intelligence

Abstract-- Human video demonstrations provide abundant training data for learning robot policies, but video alone cannot capture the rich contact signals critical for mastering manipulation. We introduce OSMO, an open-source wearable tactile glove designed for human-to-robot skill transfer . The glove features 12 three-axis tactile sensors across the fingertips and palm and is designed to be compatible with state-of-the-art hand-tracking methods for in-the-wild data collection. We demonstrate that a robot policy trained exclusively on human demonstrations collected with OSMO, without any real robot data, is capable of executing a challenging contact-rich manipulation task. On a real-world wiping task requiring sustained contact pressure, our tactile-aware policy achieves a 72% success rate, outperforming vision-only baselines by eliminating contact-related failure modes. We release complete hardware designs, firmware, and assembly instructions to support community adoption. Tactile sensing enables humans to excel at manipulation by providing real-time feedback about contact forces that vision alone cannot capture. Consider trying to dice a carrot from video alone; one cannot observe the nuanced force control that makes the task successful. Many different applied forces can result in nearly identical visual appearances, leaving critical information about force control invisible to vision.


SwarmDiffusion: End-To-End Traversability-Guided Diffusion for Embodiment-Agnostic Navigation of Heterogeneous Robots

Zhura, Iana, Karaf, Sausar, Batool, Faryal, Mudalige, Nipun Dhananjaya Weerakkodi, Serpiva, Valerii, Abdulkarim, Ali Alridha, Fedoseev, Aleksey, Seyidov, Didar, Amjad, Hajira, Tsetserukou, Dzmitry

arXiv.org Artificial Intelligence

Abstract--Visual traversability estimation is critical for autonomous navigation, but existing VLM-based methods rely on hand-crafted prompts, generalize poorly across embodiments, and output only traversability maps, leaving trajectory generation to slow external planners. We propose SwarmDiffusion, a lightweight end-to-end diffusion model that jointly predicts traversability and generates a feasible trajectory from a single RGB image. T o remove the need for annotated or planner-produced paths, we introduce a planner-free trajectory construction pipeline based on randomized way-point sampling, B ezier smoothing, and regularization enforcing connectivity, safety, directionality, and path thinness. This enables learning stable motion priors without demonstrations. SwarmDiffusion leverages VLM-derived supervision without prompt engineering and conditions the diffusion process on a compact embodiment state, producing physically consistent, traversable paths that transfer across different robot platforms. Across indoor environments and two embodiments (quadruped and aerial), the method achieves 80-100% navigation success and 0.09s inference, and adapts to a new robot using only 500 additional visual samples. ELIABLE indoor navigation is fundamental to a wide range of robotic applications, including warehouse automation [1], industrial inspection [2], search and rescue, and autonomous logistics. In these settings, robots must continuously reason about where they can safely move and how to plan a feasible trajectory through cluttered, unstructured, and dynamic spaces.


FOM-Nav: Frontier-Object Maps for Object Goal Navigation

Chabal, Thomas, Chen, Shizhe, Ponce, Jean, Schmid, Cordelia

arXiv.org Artificial Intelligence

Abstract-- This paper addresses the Object Goal Navigation problem, where a robot must efficiently find a target object in an unknown environment. Existing implicit memory-based methods struggle with long-term memory retention and planning, while explicit map-based approaches lack rich semantic information. T o address these challenges, we propose FOM-Nav, a modular framework that enhances exploration efficiency through Frontier-Object Maps and vision-language models. Our Frontier-Object Maps are built online and jointly encode spatial frontiers and fine-grained object information. Using this representation, a vision-language model performs multimodal scene understanding and high-level goal prediction, which is executed by a low-level planner for efficient trajectory generation. T o train FOM-Nav, we automatically construct large-scale navigation datasets from real-world scanned environments. Extensive experiments validate the effectiveness of our model design and constructed dataset. FOM-Nav achieves state-of-the-art performance on the MP3D and HM3D benchmarks, particularly in navigation efficiency metric SPL, and yields promising results on a real robot. Autonomous navigation has been a long-standing challenge in robotics [1], dating back to the pioneering work on the robot Shakey [2] in the 1960s. While early work focused on navigating to specific points [3], [4] with a preconstructed map [5], [6], recent research has progressively shifted towards navigation in unknown environments using textual [7], [8] or visual [9] goals, which is an essential capability for enabling mobile manipulation systems [10], [11] to perform diverse real-world tasks. In this work, we focus on the object goal navigation task (ObjectNav) [8], where an agent must navigate to a target object category in an unknown environment using RGB-D observations. This task requires long-horizon multimodal scene understanding and efficient exploration. The robot should not only recognize objects within its current field of view but also use previous observations to develop more accurate scene understanding.


MonoDream: Monocular Vision-Language Navigation with Panoramic Dreaming

Wang, Shuo, Wang, Yongcai, Fan, Zhaoxin, Wang, Yucheng, Chen, Maiyue, Wang, Kaihui, Su, Zhizhong, Li, Wanting, Cai, Xudong, Jin, Yeying, Li, Deying

arXiv.org Artificial Intelligence

Vision-Language Navigation (VLN) tasks often leverage panoramic RGB and depth inputs to provide rich spatial cues for action planning, but these sensors can be costly or less accessible in real-world deployments. Recent approaches based on Vision-Language Action (VLA) models achieve strong results with monocular input, yet they still lag behind methods using panoramic RGB-D information. We present MonoDream, a lightweight VLA framework that enables monocular agents to learn a Unified Navigation Representation (UNR). This shared feature representation jointly aligns navigation-relevant visual semantics (e.g., global layout, depth, and future cues) and language-grounded action intent, enabling more reliable action prediction. MonoDream further introduces Latent Panoramic Dreaming (LPD) tasks to supervise the UNR, which train the model to predict latent features of panoramic RGB and depth observations at both current and future steps based on only monocular input. Experiments on multiple VLN benchmarks show that MonoDream consistently improves monocular navigation performance and significantly narrows the gap with panoramic-based agents.


Learning to Predict Aboveground Biomass from RGB Images with 3D Synthetic Scenes

Zuffi, Silvia

arXiv.org Artificial Intelligence

Forests play a critical role in global ecosystems by supporting biodiversity and mitigating climate change via carbon sequestration. Accurate aboveground biomass (AGB) estimation is essential for assessing carbon storage and wildfire fuel loads, yet traditional methods rely on labor-intensive field measurements or remote sensing approaches with significant limitations in dense vegetation. In this work, we propose a novel learning-based method for estimating AGB from a single ground-based RGB image. We frame this as a dense prediction task, introducing AGB density maps, where each pixel represents tree biomass normalized by the plot area and each tree's image area. We leverage the recently introduced synthetic 3D SPREAD dataset, which provides realistic forest scenes with per-image tree attributes (height, trunk and canopy diameter) and instance segmentation masks. Using these assets, we compute AGB via allometric equations and train a model to predict AGB density maps, integrating them to recover the AGB estimate for the captured scene. Our approach achieves a median AGB estimation error of 1.22 kg/m^2 on held-out SPREAD data and 1.94 kg/m^2 on a real-image dataset. To our knowledge, this is the first method to estimate aboveground biomass directly from a single RGB image, opening up the possibility for a scalable, interpretable, and cost-effective solution for forest monitoring, while also enabling broader participation through citizen science initiatives.